International Journal of Current Research and Review 





Research Article 


Scopus’ 





DOI: http://dx.doi.org/10.31782/IICRR.2020.121920 


Semi Supervised Learning to Classify Drug 
e Resistant Tuberculosis 








IJCRR 
Section: Healthcare 


Prabu Setyaji' and Preethi Subramanian? 


Sci. Journal Impact 
Factor: 6.1 (2018) 
ICV: 90.90 (2018) 


"Asia Pacific University (APU), Kuala Lumpur, Malaysia. 





@ LoS 


BY NC 
Copyright@IJCRR 


ABSTRACT 





Background: Health is one of the vital factors for human survival and continuous efforts are focused on research in this domain. 
Spreading of tuberculosis increases from year to year and is a life threat that exists since antiquity. 


Problem: The problem gets complicated with other viruses like HIV and drug-resistant tuberculosis. World Health Organization 
has published data about drug-resistant tuberculosis to analyze the problems faced. 


Objective: This paper focuses on applying a semi-supervised learning model to prescribe recommendations based on data 
analytics. Three models such as the Decision Tree, Gradient Boosting and Neural Network are trained to predict the clusters. 
Gradient Boosting can perform the best with the lowest misclassification rate and the majority cluster is identified based on its 
impact and population. 


Conclusion: The outcome of this analysis can provide recommendations to the health domain to reduce the spread of diseases 


like tuberculosis and also enhance the preparedness in terms of drug production. 
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INTRODUCTION 


World Health Organization (WHO) is the global govern- 
ing body which oversees health issues and related concerns 
across several countries and continents. One of the major 
health threats is Tuberculosis (TB) as there are 10 million ac- 
tive cases with 1.5 million death cases. Moreover, TB cases 
increase undiagnosed each year at a rate of 3 million’. TB is 
one of the oldest disease caused by Mycobacterium Tuber- 
culosis bacteria that infected human lung alveoli (respiration 
tube). It is known to have killed an estimated 2 million hu- 
mans every year according’. It can be cured by vaccine and 
antibiotics, however, the problem of TB persists and one of 
the main issues is drug resistance for the TB patients. WHO’ 
stated that drug-resistant issue came from the misuse of an- 
tibiotics in chemotherapy. A survey’ on these drug-resistant 
patients is not conclusive with inaccurate responses. Current 
data and analysis published by WHO cannot estimate the fu- 
ture of drug-resistant TB cases and this leads to the inability 
in predicting the estimate of drug production. 
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At the moment WHO merely has drug-resistant TB data and 
analysis up until 20174. It has been stated that 56% of the 
patients have not been cured and referring to the data that 
the organization published the number of drug-resistant TB 
patients increase from 2017 to 2018 and thereafter. Mezwa 
et al. added that drug resistance can exaggerate into Multi- 
Drug Resistant (MDR) cases where that patient is immune to 
more than one antibiotics such as Rifampicin and Isoniazid. 
Khan Academy! has stated that these two drugs are the most 
important first-line drug for TB medicine as second-line 
drugs are only for emergency and they are also expensive. 
Extensive Drug Resistant (XDR) is far worse than second- 
line drugs would not be effective and the condition implies 
that new drugs will be needed. The number of deaths and 
drug-resistant cases needs to be predicted so that future es- 
timates can help in drug production planning. The paper fo- 
cuses on applying data analytical methods to predict future 
cases and perform prescriptive analytics. 
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MATERIALS AND METHODS 


This research utilizes the dataset from the WHO website 
and the dataset contains data for drug-resistant Tuberculosis > 
6&7. The data exists for over 200 countries for the year 2018. 
CRISP-DM was adopted to manage the task accordingly 


Cluster Identification and Profiling 

The dataset is given by WHO appears crude and required a 
lot of data wrangling steps. Due to the nature of the dataset, 
the clustering division is not equal as most of the observa- 
tions belong to cluster number 3. Careful consideration has 
been provided to select the clustering algorithms and the 
number of clusters. However, an uneven sized cluster was 
concluded as the best result as shown in Figure 1. 





© Segment Size 





Figure 1: Cluster Diagram. 


Then Table 1 shows the frequency of each cluster that helps 
to clarify the characteristics of each cluster. It was found that 
Cluster 2 has the least number of observations and this clus- 
ter may not have a high impact. The cluster distance table 
can be seen in Table 2 


Table 1: Identification of Clusters 


Clustering Improvem oe Frequency |Root-Mea | Maximum | Nearest 


Criterion 









Deviation | Cluster 











0.238542 0 o 3 0.754512 25.35754 
0.238542 0 0 . 12.18366 
0.238542 0 0 176 0.188115 8.918836 


Table 2: Inter-Cluster Distances 





SEGMENT_ |_1 2 3 
1 0 95200.29 26427.88 
2 95200.29 0 118329.9 
3 26427.88 118329.9 0 





The modelling procedure continues to profile the segments 
obtained, where the result of each cluster can be displayed to 
emphasis the characteristic of each cluster and which vari- 
ables have a high impact as shown in the Appendix in Figure 
7. Although Cluster 3 and Cluster 1 shares the same impor- 
tant variables, the segment profile shows that Cluster 3 holds 
176 observations which are the majority. However, it does 
not signify that Cluster 1 is less important because it iden- 
tifiesDstRIt Rr new and DstRIt Rr ret variables that repre- 
sent people that have drug-resistant TB. This detail explains 
the drug capability of handling patients. On the other hand, 
Cluster 2 does not stand a chance to be focused because of 
the negligible number of observations. Overall, Cluster 3 1s 
the most significant as it contains an important variable like 
pulmlabconf rets which identifies the people that relapse or 
still sick from the previous treatment. This will help to regu- 
late pulmonary TB and can be noted as a finding from this 
analysis. 


Moving on in the analysis, data wrangling had been car- 
ried out with missing value imputation being performed on 
the complete dataset using average/mode values. The data 
was partitioned before feeding into the predictive models. 
Models such as Decision Tree, Gradient Boosting and Neu- 
ral Network were designed to predict the target. The models 
predicted the segment variable or cluster-id created in the 
earlier phase. 





Critical Interpretation of the models 

In this section, the focus will be on interpretation of the result 
in term of action that can be conducted to assess the business 
issue of the organization. 


Decision Tree 

Decision Tree produces a satisfactory result as it has a low 
misclassification rate on training but, quite high on valida- 
tion that represents the main measure of the classification 
model in predicting the value as Figure 2 shows the plot of 
misclassification rate of the model. 
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Figure 2: Decision Tree Misclassification Rate 
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Figure 3 details the decision tree and it shows that the model 
can predict cluster 3 with high confidence and cluster 1 as 
well. One of the reasons for the higher prediction of Cluster 
3 1s because of the number of observations that belong to that 
cluster. Results of the decision tree show the variables that 
affect model accuracy and RR ret is the number of patients 
that have been previously treated for drug-resistant Tubercu- 
losis and still ongoing/relapse. pulmlabconf ret is the number 
of pulmonary Tuberculosis patients who have been treated 
previously. 


Statistic Train Validation 


1: 12.06% 12.50% 
2: 1.42% 3.13% 
3: 86.52% 84.38% 
Count: 141 64 





Imputed: Rr Ret 


< 271.5 Or Missing >= r 


Node Id: 3 

Statistic Train Validation 
1: 75.00% 42.86% 
2: 16.67% 28.57% 
3: 8.33% 28.57% 


Count: 12 7 





imputed: Pulm Labconf Ret 


>= 797.5236 


< 746.0236 Or Missing 


[ A, 797.5236 ) 


Node Id: 27 

Statistic Train Validation 
1: 77.78% 80. 00% 
2: 0.00% 0.00% 
3: 22.22% 20. 00% 





Count: 9 5 





Figure 3: Decision Tree Result. 


Both the important variables support the emphasis of prepar- 
ing drugs for drug-resistant Tuberculosis. 


Gradient Boosting 

Among other models, Gradient Boosting has performed the 
best as it has produced a simple tree structure with high ac- 
curacy and also maintains a low misclassification rate for 
both train and validation as shown in Figure 4. 
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Figure 4: Gradient Boosting Misclassification Rate. 





The tree produced by the Gradient Boosting has more nodes 
compare to the Decision Tree. More variables are being con- 
sidered and impacted by the model. Pulm lab conf ret still 
maintain their importance but with the addition of other at- 
tributes as in Figure 5. The first variable displayed as the 
filter is DstRItRr Ret that represent the number of patients 
tested with first-line drugs and having resistance to Ri- 
fampicin. PulmLabconf New is the new patients identified 
having pulmonary Tuberculosis. 





imputed: Dst Rit Rr Ret 








>= 295 


< 295 Or Missing 


imputed: Pulm Labconf New 





< 18109.5 Or Missing 


>= i (cia 





imputed: Pulm Labconf Ret 


>= 797.5236 


< 746.0236 Or Missing 





[ ESSASI 797.5236 ) 






Node Id: 12 
Statistic Train Validation 
1: 66.67% 80.00% 


2: 6.00% 
3: 33.33% 


Figure 5: Gradient Boosting Result. 


These two variables seem to have a huge impact on the pre- 
diction as a new identified Tuberculosis patient could lead to 
drug-resistant cases. Identifying patient resistance to one of 
the first-line drugs 1s vital to study the need for the effectivity 
of Rifampicin that acts as the first-line drug for Tuberculosis 
patients. 


Neural Network 

Neural Network is the third model that is tested to estimate 
the target and it surprisingly has the highest misclassifica- 
tion rate on validation data and it has a different behaviour 
compared to the rest. Neural Network choserr_hivposas the 
only option in the Neural Network but it never appeared ac- 
cording to the other techniques tried out. Also, Neural Net- 
work only predicts dominant clusters such as cluster 1 and 3. 
Interpretation of the surrogate tree result is shown in Figure 
6 below. 


Despite the differences in the estimation, this model can still 
the clusters very well, however, the variables that are em- 
ployed are varied and it is more focused on HIV despite its 
connection to drug-resistant. 


CC _——— aa rr ah mca a a 
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Node Id: il 
Statistic Train Validation 
1: 9.22% 9.52% 
3: 90.78% 90.48% 
63 


Count: 141 





Imputed: Rr Ret 


< 251 Or Missing 


Ea 


Node Id: 14 
Statistic Train Validation 
1: 92.31% 100.00% 
3: 7.69% 0.00% 
Count: 13 6 





imputed: Rr Hivpos 


>= 33.5 


Node Id: 40 
Statistic Train Validation 
1: 20.00% 0.00% 


< 33.5 Or Missing 


3: 80.00% 
Count: 5 


100 . 00% 
1 





Figure 6: Neural Network — Surrogate Model. 


Interpretation Summary 

Overall summary of the interpretation will be discussed in 
this section and Table 3 states a comparison amongst models 
to point out which one is the best model to be used to solve 
the issues based on misclassification rate. 





Table 3: Fit Statistics 


Selected 
Model 


M Boost 
Tree4 
Neural 


Gradient... SEGME... 
Decision ..._SEGME... 
Neural N... SEGME... 


Segment... 
Segment... 
Segment... 





Refer to Table 3, Gradient Boosting has the best misclassifi- 
cation rate followed by Decision Tree Then Neural Network 
with the highest misclassification. With this result, Gradient 
Boosting is the perfect fit model to solve the issue that 1s 
being discussed to predict drug-resistant Tuberculosis based 
on the main number of Tuberculosis patients from previous 
and new identifications. It was also found that the number of 
patients tested and immune to Rifampicin regardless of Iso- 
niazid is one of the important criteria to predict the cluster. 
It is evidenced that Cluster 3 is the most important of them 
all due to its vast percentage and also since most of the vari- 
ables impacted are the necessary ones to prescribe the drug- 
resistant (RR) value. 





CONCLUSION AND RECOMMENDATIONS 


Selected data analytical models were employed to classify the 
drug-resistant tuberculosis data based on the clusters identi- 
fied. Gradient Boosting model was identified to have the best 
prediction performance and this model found the three ma- 


jor variables that could explain the probable outcomes. The 
model can identify drug-resistant TB patients and also previ- 
ous relapse patients with complete details. With these hidden 
patterns uncovered in data, the health authorities would be 
able to predict the future of drug-resistant TB by studying 
the historical data. According to the data studied from WHO, 
the observations were divided into 3 clusters and Cluster 3 
hold the major decision in predicting the possible incoming 
drug-resistant TB. With the presence of Cluster 3 cases, the 
analysis also provides details on the countries or regions that 
are more vulnerable to the drug-resistant TB. The analysis 
also reveals the type of drug-resistant TB, the effectiveness 
of the lines of injectable drugs. Further improvement can still 
be applied to the model as the Gradient Boosting has utilized 
all default parametrizations in this analysis. A similar analy- 
sis can also aid the health authorities to be better prepared in 
cases of pandemics, drug production and improved medical 
attention by prescribing the right drugs based on the varia- 
tions observed. 
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Figure 7: Segment Profile and Characteristics of Clusters. 
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